Search CORE

101 research outputs found

Exploiting Record Similarity for Practical Vertical Federated Learning

Author: He Bingsheng
Li Qinbin
Wu Zhaomin
Publication venue
Publication date: 11/06/2021
Field of study

As the privacy of machine learning has drawn increasing attention, federated learning is introduced to enable collaborative learning without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, existing studies in VFL rarely study the ``record linkage'' process. They either design algorithms assuming the data from different parties have been linked or use simple linkage methods like exact-linkage or top1-linkage. These approaches are unsuitable for many applications, such as the GPS location and noisy titles requiring fuzzy matching. In this paper, we design a novel similarity-based VFL framework, FedSim, which is suitable for more real-world applications and achieves higher performance on traditional VFL tasks. Moreover, we theoretically analyze the privacy risk caused by sharing similarities. Our experiments on three synthetic datasets and five real-world datasets with various similarity metrics show that FedSim consistently outperforms other state-of-the-art baselines

arXiv.org e-Print Archive

Privacy-Preserving Gradient Boosting Decision Trees

Author: He Bingsheng
Li Qinbin
Wen Zeyi
Wu Zhaomin
Publication venue
Publication date: 28/10/2020
Field of study

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines

arXiv.org e-Print Archive

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams

Author: Diao Yiqun
He Bingsheng
Li Qinbin
Lu Mian
Yang Yutong
Publication venue
Publication date: 03/09/2023
Field of study

How to get insights from relational data streams in a timely manner is a hot research topic. This type of data stream can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning. While existing studies have been done on incremental learning for data streams, their evaluations are mostly conducted with manually partitioned datasets. Thus, a natural question is how those open environment challenges look like in real-world relational data streams and how existing incremental learning algorithms perform on real datasets. To fill this gap, we develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in relational data streams. Specifically, we investigate 55 real-world relational data streams and establish that open environment scenarios are indeed widespread in real-world datasets, which presents significant challenges for stream learning algorithms. Through benchmarks with existing incremental learning algorithms, we find that increased data quantity may not consistently enhance the model accuracy when applied in open environment scenarios, where machine learning models can be significantly compromised by missing values, distribution shifts, or anomalies in real-world data streams. The current techniques are insufficient in effectively mitigating these challenges posed by open environments. More researches are needed to address real-world open environment challenges. All datasets and code are open-sourced in https://github.com/sjtudyq/OEBench

arXiv.org e-Print Archive

Effective and Efficient Federated Tree Learning on Hybrid Data

Author: He Bingsheng
Li Bo
Li Qinbin
Liu Xiaoyuan
Song Dawn
Xie Chulin
Xu Xiaojun
Zhang Ce
Publication venue
Publication date: 18/10/2023
Field of study

Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines

arXiv.org e-Print Archive

Simulation of upper tropospheric CO₂ from chemistry and transport models

Author: Chahine Moustafa T.
Chen Luke
Jiang Xun
Li Qinbin
Liang Mao-Chang
Olsen Edward T.
Shia Run-Lie
Yung Yuk L.
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 30/12/2008
Field of study

The California Institute of Technology/Jet Propulsion Laboratory two-dimensional (2-D), three-dimensional (3-D) GEOS-Chem, and 3-D MOZART-2 chemistry and transport models (CTMs), driven respectively by NCEP2, GEOS-4, and NCEP1 reanalysis data, have been used to simulate upper tropospheric CO2 from 2000 to 2004. Model results of CO2 mixing ratios agree well with monthly mean aircraft observations at altitudes between 8 and 13 km (Matsueda et al., 2002) in the tropics. The upper tropospheric CO2 seasonal cycle phases are well captured by the CTMs. Model results have smaller seasonal cycle amplitudes in the Southern Hemisphere compared with those in the Northern Hemisphere, which are consistent with the aircraft data. Some discrepancies are evident between the model and aircraft data in the midlatitudes, where models tend to underestimate the amplitude of CO2 seasonal cycle. Comparison of the simulated vertical profiles of CO2 between the different models reveals that the convection in the 3-D models is likely too weak in boreal winter and spring. Model sensitivity studies suggest that convection mass flux is important for the correct simulation of upper tropospheric CO2

Caltech Authors

CO_2 semiannual oscillation in the middle troposphere and at the surface

Author: Chahine Moustafa T.
Chen Luke L.
Jiang Xun
Li Qinbin
Liang Maochang
Olsen Edward T.
Wang Jingqian
Yung Yuk L.
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 01/09/2012
Field of study

Using in situ measurements, we find a semiannual oscillation (SAO) in the midtropospheric and surface CO_2. Chemistry transport models (2-D Caltech/JPL model, 3-D GEOS-Chem, and 3-D MOZART-2) are used to investigate possible sources for the SAO signal in the midtropospheric and surface CO_2. From model sensitivity studies, it is revealed that the SAO signal in the midtropospheric CO_2 originates mainly from surface CO_2 with a small contribution from transport fields. It is also found that the source for the SAO signal in surface CO_2 is mostly related to the CO_2 exchange between the biosphere and the atmosphere. By comparing model CO_2 with in situ CO_2 measurements at the surface, we find that models are able to capture both annual and semiannual cycles well at the surface. Model simulations of the annual and semiannual cycles of CO_2 in the tropical middle troposphere agree reasonably well with aircraft measurements

Caltech Authors

Recommended from our members

Satellite remote sounding of mid-tropospheric CO_2

Author: Chahine M. T
Chen Luke
Dimotakis Paul
Jiang Xun
Li Qinbin
Olsen Edward T
Pagano Thomas
Randerson James
Yung Yuk L
Publication venue: 'American Geophysical Union (AGU)'
Publication date: 01/09/2008
Field of study

Human activity has increased the concentration of the earth's atmospheric carbon dioxide, which plays a direct role in contributing to global warming. Mid-tropospheric CO_2 retrieved by the Atmospheric Infrared Sounder shows a substantial spatiotemporal variability that is supported by in situ aircraft measurements. The distribution of middle tropospheric CO_2 is strongly influenced by surface sources and large-scale circulations such as the mid-latitude jet streams and by synoptic weather systems, most notably in the summer hemisphere. In addition, the effects of stratosphere-troposphere exchange are observed during a final stratospheric warming event. The results provide the means to understand the sources and sinks and the lifting of CO_2 from surface layers into the free troposphere and its subsequent transport around the globe. These processes are not adequately represented in three chemistry-transport models that have been used to study carbon budgets

eScholarship - University of California

Caltech Authors